From zero to practice
Mail: javier.alvarez.liebana@upm.es.
Javier Álvarez Liébana from Carabanchel (Madrid).
Degree in Mathematics (UCM). PhD in Statistics (UGR).
Data visualization and analysis for the Principality of Asturias (2021-2022) during the COVID pandemic
Member of the Spanish Society of Statistics and OR and the Spanish Royal Mathematical Society.
Currently, Assistant Professor at Technical University of Madrid (UPM) and Visiting Researcher at Harvard University.
Take away the fear of programming → learn to program by programming
Understanding basic R concepts from scratch → learning to abstract ideas and algorithms
Utility of programming → reproducible, transparent and maintainable workflows.
Introduction to analysis and preprocessing of data → {tidyverse}.
Introduction to dataviz in R → {ggplot2}
To learn the fundamentals of statistics and Machine Learning. From descriptive analysis to prediction: building our first models.
Quarto. In the slide menu (bottom left) you have an option to download them in pdf in Tools
workbooks folder.cheatsheets-packages folder.Introduction to R and RStudio. Working with projects. First uses of functions and packages. Basic data types
For the course, the only requirements will be:
We will program as we write
R)RStudio) to write itThe R language will be our grammar and spelling (our rules of the game)
Step 1: go to https://cran.r-project.org/ and select your operating system.
Step 2: for Mac, simply click on the .pkg file, and open it once downloaded. For Windows systems, we need to click on install R for the first time and then on Download R for Windows. Once downloaded, open it like any installation file.
Step 3: open the installation executable.
Warning
Whenever you need to download something from CRAN (either R itself or a package), make sure you have an internet connection.
To check the installation, after opening R, you should see the R GUI (Graphical User Interface) with a white screen similar to this (console).
To check the installation, after opening R, you should see the R GUI (Graphical User Interface) with a white screen similar to this (console).
First code: we will assign the value 1 to a variable called a (we will write the code in the console and press “enter”). Then we will do the sum a + b.
To check the installation, after opening R, you should see the R GUI (Graphical User Interface) with a white screen similar to this (console).
First code: we will assign the value 1 to a variable called a (we will write the code in the console and press “enter”). Then we will do the sum a + b.
Note that…
In the console, a number [1] appears: it’s simply an element counter (like counting rows in Word)
RStudio will be the Word we will use to write (what is known as an IDE: Integrated Development Environment).
Step 1: go to the official RStudio website (now called Posit) and select the free download.
Step 2: select the executable that appears according to your operating system.
Step 3: after downloading the executable, open it like any other and let the installation finish.
When you open RStudio you will likely have three windows:
When you open RStudio you will likely have three windows:
When you open RStudio you will likely have three windows:
R is the evolution of the work of Bell Laboratories with the S language, which was brought into the open-source world by Ross Ihaka and Robert Gentleman in the 1990s. The version R 1.0.0 was released on February 29, 2000.
R is the statistical language par excellence, created by and for statisticians, with 6 fundamental advantages over Excel, SAS, Stata, or SPSS:
R community is to share code under copyleft → ethical use of spending and algorithmsR is the statistical language par excellence, created by and for statisticians, with 6 fundamental advantages over Excel, SAS, Stata, or SPSS:
Automate → it will allow you to automate recurring tasks.
Replicability → you will be able to replicate your analysis in the same way every time.
Flexibility → you will be able to adapt the software to your needs.
Transparency → to be audited by the community.
One of the key ideas of R is the use of packages: codes that other people have implemented to solve a problem
Once installed, there are two ways to use a package (take it off the shelf)
library(), using the package name without quotes, we load the whole book into the sessionDuring your learning, it will be very common for things not to work out on the first try → you will be wrong. It will not only be important to accept it but also to read the error messages to learn from them.
A script will be the document in which we program, our .doc file (here with a .R extension) where we will write the commands. To open our first script, click on the menu in File < New File < R Script.
Be careful
It’s important not to overuse the console: everything you don’t write in a script, when you close, will be lost.
Be careful
R is case-sensitive: it is sensitive to uppercase and lowercase, so x and X represent different variables.
Now we have a fourth window: the window where we will write our codes. How do we run it?
Save current document.Ctrl+EnterJust as we usually work organized by folders on the computer, in RStudio we can do the same to work efficiently by creating projects.
A project will be a “folder” within RStudio, so our root directory will automatically be the project folder itself (allowing us to switch from one project to another using the top right menu).
We can create one in a new folder or in an existing folder.”
📝 Create a course folder on your computer and set up an RStudio Project inside it. This will serve as your working directory for the entire course. After creating the project, you will see an .Rproj file. Within this folder, create two subfolders: data (for datasets) and scripts (for the .R files from each session).
📝 Inside the project create a script Exercises-class1.R (inside the scripts folder). Once created, define in it a variable named a and whose value is -1. Execute the code as you want
📝 Add below another line to define a variable b with the value 5. Then save the multiplication of both variables. Execute the code as you want.
📝 Modify the code below to define two variables c and d, with values 3 and -1. Then divide the variables and save the result.
📝 Assign to x a positive value and then compute its square root; assign to y a negative number and compute its absolute value using abs().
Note that…
Commands like sqrt(), abs() or max() are what we call functions: lines of code that we have “encapsulated” under a name, and given some input arguments, execute the commands (a sort of shortcut). In the functions the arguments will ALWAYS be enclosed in parentheses
📝 Using the variable x already defined, complete/modify the code below to store in a new variable z the result stored in x minus 5.
📝 Define an x variable and assign it the value -1. Define another y and assign it the value 0. Then perform the operations a) x by y; b) square root of x. What do you get?
What data type can we have in each cell of a table?
Before we continue, it’s important to know something as soon as possible: starting with programming can be frustrating
Just like when learning a new language, the first obstacle is not so much what to say but how to say it correctly. The same goes for R, so let’s standardize our programming style as much as possible to avoid future errors.
R does not process spaces).snake_case.Tools < Global Options, you can customize some options in RStudio. In Code < Display, you can set Show margin to display an “imaginary” margin (not interacting with the code) to “force” you to make line breaks.RStudio, there’s a wonderful tool: if you type part of a variable or function name and press tab, RStudio will autocomplete it for you.Tools < Global Options < Code < Display and enable the Rainbow parentheses option.RStudio will notify you.class1.R in the project we created before)
See more tips at https://r4ds.had.co.nz/workflow-basics.html#whats-in-a-name
Are there variables beyond numbers in data science? For example, think about the data you might store about a person:
TRUE if enrolled or FALSE otherwise).The simplest data (which we’ve already used) will be numeric variables. To find out the data class in R of a variable, we use the class() function.
The simplest data type (we have already used it) will be the numeric variables. To know the data class in R of a variable we have the function class().
To know its typology (format) variable we have typeof().
[1] "double"
[1] "integer"
Note that…
In R we have a collection of functions starting with as.x() that serve as conversion functions: a data that was of one type, we convert it to type x.
In addition to the “common” numbers we will have the plus/minus infinity coded as Inf or -Inf.
With numeric variables we can perform the arithmetic operations of a calculator: adding (+)…
Let us imagine that, in addition to the age of a person we want to store his/her name: now the variable will be of type character.
The text strings are a type with which we obviously cannot perform arithmetic operations (other operations such as pasting or locating patterns can be performed).
Reminder
Text variables (character or string) are ALWAYS in quotes: TRUE (logical, binary value) is not the same as "TRUE" (text).
As we have commented R we will call function a piece of encapsulated code under a name, and which depends on some input arguments. Our first function will be paste(): given two strings, it allows us to paste them together.
Note that default pastes strings with a space, but we can add an optional argument to tell it the separator (in sep = ...).
Remember that functions are always as name_of_function(arguments), whereas we will use [i] to access to i-th element.
How do I know what arguments does a function need?
By typing ? paste in the console, you will get a help in the multipurpose panel, where you can see in its header what arguments the function already has default arguments assigned to it.
The arguments (and their detail) can also be consulted by tabulating (after a comma).
It is very important to understand the concept of default argument of a function in R: it is a value that the function uses but sometimes we may not see because already has a value assigned.
[1] "Javi Álvarez"
[1] "Javi Álvarez"
Note
The = operator is reserved for assigning arguments within functions. For all other assignments, we will use <-.
A more intuitive way to work with text is to use the {glue} package: the first thing to do is to “buy the book” (if we have never done it before). After that load the package
With the glue() function of that package we can use variables inside strings. For example, “age is … years old”, where the age is stored in a variable.
Another fundamental type will be the logical or binary variables (two values):
TRUE: true stored internally as a 1.
FALSE: false stored internally as a 0.
As we will see shortly, logical variables can actually take a third value: NA or missing data, representing not available, and it will be very common to find it within a database.
Logical values are usually the result of evaluate logical conditions. For example, imagine that we want to check whether a person is named Javi.
With the logical operator == we ask if what we have stored on the left is same as what we have on the right: we ASK
Note that…
It is not the same <- (assignment) as == (we are asking, it is a logical comparison).
In addition to “equal to” versus “different” comparisons, also order comparisons such as less than <, greater than >, <= or >=. Is the person less than 32 years old?
A very special data type: the date type data.
It looks like a simple text string but should represent an instant in time. What should happen if we add a 1 to a date?
Dates cannot be string/text: we must convert the text string to date.
Once installed, of all the packages (books) that we have, we will indicate it to load this one concretely.
To convert to date type we will use the as_date() function of the {lubridate} package (default in yyyy-mm-dd format).
In as_date() the default date format is yyyy-mm-dd so if the string is not entered correctly…
For any other format we must specify it in the optional argument format = ... such that %d represents days, %m months, %Y in 4-year format and %y in 2-year format.
In this package we have very useful functions for date management:
today() we can directly obtain the current date.More information
You have a pdf summary of the most important packages in the corresponding folder on campus
Try to perform the following exercises without looking at the solutions
📝 Define a variable that stores your age (called age) and another with your name (called name).
📝 Check with this variable age if it is NOT 60 years old or if it is called "Ornitorrinco" (you must obtain logical variables as a result).
📝 Why does the lower code not produce an error?
📝 Define another variable called siblings that answers the question “do you have siblings?” and another variable that stores your date of birth (called birth_date).
📝 Define another variable with your last name (called surname) and use glue() to have, in a single variable called full_name, your first and last name separated by a comma.
📝 Calculate the days that have passed since your birth date until today (with the birth date defined in Exercise 4).
📝 Why does the lower code give an error?
📝 Why does the lower code not produce an error?
📝 What do you think it is stored in the variable “healthy” below?
Concatenating cells: vectors. First databases
When working with data, we often have columns that represent variables: we will refer to these as vectors, which are a concatenation of cells (values) of the same type (similar to a column in a table).
The simplest way to create a vector is with the c() function (c stands for concatenate), and you just need to input the elements within parentheses, separated by commas.
Tip
An individual number x <- 1 (or x <- c(1)) is actually a vector of length one –> everything we know how to do with a number, we can do with a vector of numbers.
The most common type of vector is numeric, specifically, the well-known numeric sequences (e.g., the days of the month), used among other things, to index loops.
The seq(start, end) function allows us to create a [**numeric sequence]**{.hl-yellow} from a starting element to an ending one, advancing one by one.
A shortcut is the 1:n command, which returns the same as seq(1, n).
If the starting element is greater than the ending one, it understands that the sequence is in descending order.
Sometimes we may want to define a sequence with a specific length.
[1] 1.000000 9.166667 17.333333 25.500000 33.666667 41.833333 50.000000
We might also want to generate a vector of n repeated elements.
A vector is a concatenation of elements of the same type, but they don’t necessarily have to be numbers. Let’s create a sample sentence.
What will happen if we concatenate elements of different types?
Note that since all elements must be of the same type, what R does is convert everything to text, violating the data integrity.
With numeric vectors, we can perform the same arithmetic operations as with numbers → a number is a vector (of length one).
What will happen if we add or subtract a value to a vector?
Vectors can also interact with each other, so we can define, for example, vector sums (element by element).
Since the operation (e.g., a sum) is performed element by element, what will happen if we add two vectors of different lengths?
A very common operation is to ask questions of the data using logical conditions. For example, if we define a vector of temperatures…
Which days were below 22 degrees?
This will return a logical vector, depending on whether each element meets the given condition (of the same length as the vector being queried).
Logical conditions can be combined in two ways:
&) to return TRUE.|).Another common operation is accessing or getting elements. The simplest way is to use the [i] operator (access the i-th element).
Since a number is just a vector of length one, this operation can also be applied using a vector of indices to select.
Tip
To access the last element without worrying about its position, you can pass the vector’s length as the index x[length(x)].
Sometimes, instead of selecting, we may want to remove elements. This is done with the same operation but using negative indexing: the opetator [-i] «un-select» the i-th element
[1] "hi" "how" "are" "you" "?"
[1] "hi" "are" "you" "?"
In many cases, we want to select or remove elements based on logical conditions, depending on the values, so we will pass the condition itself as the index (remember, x < 2 returns a logical vector).
We can also make use of statistical operations, such as sum(), which, given a vector, returns the sum of all its elements.
What happens when a data point is missing?
As we’ve mentioned, logical values are internally stored as 0 and 1, so we can use them in arithmetic operations.
For example, if we want to find out the number of elements that meet a condition (e.g., less than 3), those that do will be assigned a 1 (TRUE), and those that don’t will get a 0 (FALSE). Therefore, summing the logical vector will give us the number of elements that meet the condition.
Another common operation that can be useful is the cumulative sum with cumsum(), which, given a vector, returns a vector where each element is the sum of the first, the first plus the second, the first plus the second plus the third, and so on.
What happens when a data point is missing?
In the case of the cumulative sum, what happens is that from that point onward, all subsequent accumulated values will be missing.
Another common operation that can be useful is the difference (with delay) with diff() which, given a vector, returns a vector with the second minus the first, the third minus the second, the fourth minus the third…and so on.
Other common operations are mean, median, percentiles, etc.
Other common operations are mean, median, percentiles, etc.
Finally, a common action is to know sort values:
sort(): returns the sorted vector. By default from smallest to largest but with decreasing = TRUE we can change it.[1] 7 20 23 25 33 41 65 77 81
[1] 81 77 65 41 33 25 23 20 7
Try to perform the following exercises without looking at the solutions
📝 Define the vector x as the concatenation of the first 5 odd numbers. Calculate the length of the vector
📝 Access the third element of x. Access the last element (regardless of length, a code that can always be executed). Delete the first element.
📝 Get the elements of x greater than 4. Calculate the vector 1/x and store it in a variable.
📝 Create a vector representing the names of 5 people, one of whom is unknown.
📝 Find from the vector x of exercises above the elements greater (strictly) than 1 and less (strictly) than 7. Find a way to find out if all the elements are positive or not.
📝 Given the vector x <- c(1, -5, 8, NA, 10, -3, 9), why does its mean return not a number but what is shown in the code below?
📝 Given the vector x <- c(1, -5, 8, NA, 10, -3, 9), extract the elements occupying the locations 1, 2, 5, 6.
📝 Given the x vector of the previous exercise, which ones have a missing data? Hint: the is.something() functions check if the element is of type something (press tab).
📝 Define the vector x as the concatenation of the first 4 even numbers. Calculate the number of elements of x strictly less than 5.
📝 Calculate the vector 1/x and obtain the ordered version (from smallest to largest) in the two possible ways
When analyzing data we usually have several variables for each individual: we need a “table” to collect them. The most immediate option is matrices: concatenation of variables of same type and equal length.
Imagine we have heights and weights of 4 people. How to create a dataset with the two variables?
We can also build the matrix by rows with the rbind() function (concatenate - bind - by rows - r), although it is recommended to have each variable in column and individual in row as we will see later.
View(matrix).We can also “flip” (transposed matrix) with t().
In some cases we will want to get the total data for an individual (a particular row but all columns) or the values of a whole variable for all individuals (a particular column but all rows). To do so, we leave one of the indexes unfilled.
We can also define a matrix from a numeric vector, rearranging the values in the form of a matrix (knowing that the elements are placed by columns).
With matrices it is the same as with vectors: when we apply an arithmetic operation we do it element by element
We can also perform operations by columns/rows without loops with the apply() function, and we will indicate as arguments
MARGIN = 1 for rows, MARGIN = 2 for columns)Try to perform the following exercises without looking at the solutions
📝 Modify the code below to define an x matrix of ones, with 3 rows and 7 columns.
📝 To the above matrix, add 1 to each number in the matrix and divide the result by 5. Then calculate its transpose
📝 Why does the code below return such a warning message?
📝 Define the matrix x <- matrix(1:12, nrow = 4). Then get the data of the first individual, the data of the third variable, and the element (4, 1).
📝 Define a matrix of 2 variables and 3 individuals such that each variable captures the height and age of 3 persons, so that the age of the second person is unknown (absent). Then calculate the mean of each variable (we should get a number!).
Arrays have the same problem as vectors: if we put together data of different types, it data integrity is compromised as it converts them (see the code below: the ages and the TRUE/FALSE are converted to text).
In order to work with variables of different type we have in R what is known as data.frame: concatenation of variables of equal length but which can be of different type.
Since a data.frame is already an attempt at a database the variables are not mere mathematical vectors: they have a meaning and we can (we must) give them names that describe their meaning.
We have our first data set! (strictly speaking we can’t talk about a database but for the moment it looks like one). You can visualize it by typing its name in console or with View(table).
If we want to access its elements, being again tabulated data, we can access as in the matrices (not recommended): again we have two indexes (rows and columns, leaving free the one we don’t use)
ages single names birth_date
2 24 NA laura 1992-04-01
[1] "javi" "laura" "lucía"
[1] 24
But it also has the advantages of a database : we can access the variables by name (recommended since the variables can change position and now they have a meaning), putting the name of the table followed by the symbol $ (with the tab, a menu of columns to choose from will appear).
names(): shows us the variable namesIf we have one already created and we want to add a column it is as simple as using the data.frame() function we have already seen to concatenate the column. Let’s add for example a new variable, the number of siblings of each individual.
Try to perform the following exercises without looking at the solutions
📝 Load from the {datasets} package the airquality dataset (New York air quality variables from May through September 1973). Is the airquality dataset of type tibble? If not, convert it to tibble (look in the package documentation at https://tibble.tidyverse.org/index.html).
📝 Once converted to tibble get the name of the variables and the dimensions of the data set. How many variables are there? How many days have been measured?
📝 Filter only the data for the month of August. How to tell it that we want only the rows that meet a specific condition?
📝 Select those data that are not from July or August.
📝 Modify the following code to keep only the ozone and temperature variables (no matter what position they are).
📝 Select the temperature and wind data for August.
The National Health and Nutrition Examination Survey (NHANES) is a large, nationally representative program conducted in the United States to assess the health and nutritional status of adults and children. NHANES combines interviews, physical examinations, and laboratory measurements. NHANES is widely used in epidemiology, public health research, and policy analysis.
ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education
1 51624 2009_10 male 34 30-39 409 White <NA> High School
2 51624 2009_10 male 34 30-39 409 White <NA> High School
3 51624 2009_10 male 34 30-39 409 White <NA> High School
4 51625 2009_10 male 4 0-9 49 Other <NA> <NA>
5 51630 2009_10 female 49 40-49 596 White <NA> Some College
6 51638 2009_10 male 9 0-9 115 White <NA> <NA>
MaritalStatus HHIncome HHIncomeMid Poverty HomeRooms HomeOwn
1 Married 25000-34999 30000 1.36 6 Own
2 Married 25000-34999 30000 1.36 6 Own
3 Married 25000-34999 30000 1.36 6 Own
4 <NA> 20000-24999 22500 1.07 9 Own
5 LivePartner 35000-44999 40000 1.91 5 Rent
6 <NA> 75000-99999 87500 1.84 6 Rent
Try to answer the questions posed in the workbook intro-R
In the {datasets} package (already installed by default) we have several datasets and one of them is airquality. Below I have extracted 3 variables from that dataset (note that it is done with data$variable, that dollar will be important in the future).The data captures daily measurements (n = 153 observations) of air quality in New York, from May to September 1973. Six 6 variables were measured: ozone levels, solar radiation, wind, temperature, month and day.
Try to answer the questions posed in the workbook intro-R
We will consider the surveys.RData file in which we have all poll surveys for Spain from 1982 to 2019.
Try to answer the questions posed in the workbook intro-R
Welcome to tidyverse. First actions against databases
Tables in data.frame format have some limitations. The main one is that does not allow recursion: imagine that we define a database with heights and weights, and we want a third variable with the BMI.
Error in data.frame(height = c(1.7, 1.8, 1.6), weight = c(80, 75, 70), : object 'weight' not found
Hereafter we will use the tibble (enhanced data.frame) format from the {tibble} package.
data_tb <-
tibble("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70), "BMI" = weight / (height^2))
class(data_tb)[1] "tbl_df" "tbl" "data.frame"
# A tibble: 3 × 3
height weight BMI
<dbl> <dbl> <dbl>
1 1.7 80 27.7
2 1.8 75 23.1
3 1.6 70 27.3
Las tablas en formato tibble nos permitirá una gestión más ágil, eficiente y coherente de los data, con 4 ventajas principales:
tribble().Tip
The {datapasta} package allows us to copy and paste tables from web pages and simple documents as a tribble. See more in https://milesmcbain.github.io/datapasta/articles/how-to-datapasta.html#pasting-a-table-as-a-formatted-tibble-definition-with-tribble_paste
R by default operations are done element to element.R are vectors: a concatenation of values of the SAME TYPE() or []?Reminder that…
x <= 0x[x <= 0] or y[x <= 0]sum(x <= 0)mean(x <= 0)all(x <= 0) or any(x <= 0)`[1] -6 -4 0 1 2
[1] 2 1 0 -4 -6
Reminder that something() means a function and arguments are inside of (): there are optional arguments that modify the default mode of functions.
Our final database format will be the tibble type object, an enhanced data.frame.
library(tibble)
tibble("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70), "BMI" = weight / (height^2))# A tibble: 3 × 3
height weight BMI
<dbl> <dbl> <dbl>
1 1.7 80 27.7
2 1.8 75 23.1
3 1.6 70 27.3
Metainformation: in the header it automatically tells us the number of rows and columns, and the type of each variable.
Recursivity: allows to define the variables sequentially (as we have seen).
Consistency: if you access a column that does not exist it warns you with a warning.
To define a tibble() ourselves we have 3 options:
tibble() function of the {tibble} package (already included in {tidyverse})# A tibble: 3 × 3
height weight BMI
<dbl> <dbl> <dbl>
1 1.7 80 27.7
2 1.8 75 23.1
3 1.6 70 27.3
or … 3. import from an Excel/csv (we will see, be patient <3).
So far, everything we have done in R has been done in the programming paradigm known as R base. When R was born as a language, many of those who programmed in it imitated forms and methodologies inherited from other languages, based on the use of
Loops for and while
Dollar $ to access to the variables
Structures if-else
And although knowing these structures can be interesting in some cases, in most cases they are obsolete and we will be able to avoid them (especially loops) since R is specially designed to work in a functional way (instead of element-by-element).
In this context of functional programming, a decade ago {tidyverse} was born, a “universe” of packages to guarantee an efficient, coherent and lexicographically simple to understand workflow, based on the idea that our data is clean and tidy.
{lubridate}: date management{rvest}: web scraping{tidymodels}: modeling/prediction{tibble}: optimizing data.frame{tidyr}: data cleaning{readr}: load rectangular data (.csv), {readxl}: import .xls and .xlsx files{dplyr}: grammar for debugging{stringr}: text handling{purrr}: list handling{forcats}: qualitative handling{ggplot2}: data visualization{lubridate}: date management{rvest}: web scraping{tidymodels}: modeling/prediction{tibble}: optimizing data.frame{tidyr}: data cleaning{readr}: load rectangular data (.csv), {readxl}: import .xls and .xlsx files{dplyr}: grammar for debugging{stringr}: text handling{purrr}: list handling{forcats}: qualitative handling{ggplot2}: data visualizationTidy datasets are all alike, but every messy dataset is messy in its own way (Hadley Wickham, Chief Scientist en RStudio)
TIDYVERSE
The universe of {tidyverse} packages is based on the idea introduced by Hadley Wickham (the God we pray to) of standardizing the format of data to
The first thing will therefore be to understand what the tidydata sets are, since the whole {tidyverse} is based on the data being standardized.
In {tidyverse} the operator pipe (pipe) defined as |> (ctrl+shift+M) will be key: it will be a pipe that traverses the data and transforms it. . . .
In R base, if we want to apply three functions first(), second() and third() in order, it would be
Important
Since version 4.1.0 of R we have |>, a native pipe available outside tidyverse, replacing the old pipe %>% which depended on the {magrittr} package (quite problematic).
The main advantage is that the code is very readable (almost literal) and you can do large operations on the data with very little code.
Before in R base
Before in R base: get the Ozone and Temperature variables from July
Within {tidyverse} we will use the {dplyr} package for the preprocessing process of the data.
The idea is that the code is as readable as possible, as if it were a list of instructions that when read tells us in a very obvious way what it is doing.
All the preprocessing process we are going to perform is on the assumption that our data is in tidydata
Remember that in {tidyverse} the pipe operator defined as |> (ctrl+shift+M) will be key: it will be a pipe that traverses the data and transforms it.
One of the most common operations is what is known in statistics as sampling: a selection or filtering of records (rows) (a subsample).
filter()).slice()).slice_sample()).group_by() + slice_sample()).The simplest action by rows is when filter records based on some logical condition: with filter() only individuals meeting certain conditions will be selected (non-random sampling by conditions).
==, !=: equal or different to (|> filter(variable == "a"))>, <: greater or less than (|> filter(variable < 3))>=, <=: greater or equal or less or equal than (|> filter(variable >= 5))%in%: values belong to a set of discrete options (|> filter(variable %in% c("blue", "green")))between(variable, val1, val2): if continuous values are inside of a range (|> filter(between(variable, 160, 180)))These logical conditions can be combined in different ways (and, or, or mutually exclusive).
Important
Remember that inside filter() there must always be something that returns a vector of logical values.
How would you go about… filter the characters with brown eyes?
What type of variable is it? –> The eye_color variable is qualitative so it is represented by texts.
# A tibble: 21 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Leia Or… 150 49 brown light brown 19 fema… femin…
2 Biggs D… 183 84 black light brown 24 male mascu…
3 Han Solo 180 80 brown fair brown 29 male mascu…
4 Yoda 66 17 white green brown 896 male mascu…
5 Boba Fe… 183 78.2 black fair brown 31.5 male mascu…
6 Lando C… 177 79 black dark brown 31 male mascu…
7 Arvel C… NA NA brown fair brown NA male mascu…
8 Wicket … 88 20 brown brown brown 8 male mascu…
9 Padmé A… 185 45 brown light brown 46 fema… femin…
10 Quarsh … 183 NA black dark brown 62 male mascu…
# ℹ 11 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
How would you go about… filter the characters that do not have brown eyes?
# A tibble: 66 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Owen La… 178 120 brown, gr… light blue 52 male mascu…
6 Beru Wh… 165 75 brown light blue 47 fema… femin…
7 R5-D4 97 32 <NA> white, red red NA none mascu…
8 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
9 Anakin … 188 84 blond fair blue 41.9 male mascu…
10 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
# ℹ 56 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
How would you go about … filter characters that have brown or blue eyes?
# A tibble: 40 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Leia Or… 150 49 brown light brown 19 fema… femin…
3 Owen La… 178 120 brown, gr… light blue 52 male mascu…
4 Beru Wh… 165 75 brown light blue 47 fema… femin…
5 Biggs D… 183 84 black light brown 24 male mascu…
6 Anakin … 188 84 blond fair blue 41.9 male mascu…
7 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
8 Chewbac… 228 112 brown unknown blue 200 male mascu…
9 Han Solo 180 80 brown fair brown 29 male mascu…
10 Jek Ton… 180 110 brown fair blue NA <NA> <NA>
# ℹ 30 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Note that %in% is equivalent to concatenating several == with a conjunction or (|)
# A tibble: 40 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Leia Or… 150 49 brown light brown 19 fema… femin…
3 Owen La… 178 120 brown, gr… light blue 52 male mascu…
4 Beru Wh… 165 75 brown light blue 47 fema… femin…
5 Biggs D… 183 84 black light brown 24 male mascu…
6 Anakin … 188 84 blond fair blue 41.9 male mascu…
7 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
8 Chewbac… 228 112 brown unknown blue 200 male mascu…
9 Han Solo 180 80 brown fair brown 29 male mascu…
10 Jek Ton… 180 110 brown fair blue NA <NA> <NA>
# ℹ 30 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
How would you go about … filter the characters that are between 120 and 160 cm?
What type of variable is it? –> The variable height is a continuous quantitative variable so we must filter by ranges of values (intervals) –> we will use between().
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Leia Org… 150 49 brown light brown 19 fema… femin…
2 Mon Moth… 150 NA auburn fair blue 48 fema… femin…
3 Nien Nunb 160 68 none grey black NA male mascu…
4 Watto 137 NA black blue, grey yellow NA male mascu…
5 Gasgano 122 NA none white, bl… black NA male mascu…
6 Cordé 157 NA brown light brown NA <NA> <NA>
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
How would you… filter characters that have eyes and are not human?
# A tibble: 3 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yoda 66 17 white green brown 896 male mascu…
2 Wicket S… 88 20 brown brown brown 8 male mascu…
3 Eeth Koth 171 NA black brown brown NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
How would you… filter characters that have eyes and are not human, or are over 60 years old? Think it through: the parentheses are important: \((a+b)*c\) is not the same as \(a+(b*c)\).
# A tibble: 18 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none mascu…
2 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
3 Chewbac… 228 112 brown unknown blue 200 male mascu…
4 Jabba D… 175 1358 <NA> green-tan… orange 600 herm… mascu…
5 Yoda 66 17 white green brown 896 male mascu…
6 Palpati… 170 75 grey pale yellow 82 male mascu…
7 Wicket … 88 20 brown brown brown 8 male mascu…
8 Qui-Gon… 193 89 brown fair blue 92 male mascu…
9 Finis V… 170 NA blond fair blue 91 male mascu…
10 Quarsh … 183 NA black dark brown 62 male mascu…
11 Shmi Sk… 163 NA black fair brown 72 fema… femin…
12 Mace Wi… 188 84 none dark brown 72 male mascu…
13 Ki-Adi-… 198 82 white pale yellow 92 male mascu…
14 Eeth Ko… 171 NA black brown brown NA male mascu…
15 Cliegg … 183 NA brown fair blue 82 male mascu…
16 Dooku 193 80 white fair brown 102 male mascu…
17 Bail Pr… 191 NA black tan brown 67 male mascu…
18 Jango F… 183 79 black tan brown 66 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
There is a special filter for one of the most common operations in debugging: remove absent. For this we can use inside a filter is.na(), which returns TRUE/FALSE depending on whether it is absent, or …
Use drop_na(): if we do not specify a variable, it removes records with missing in any variable. Later on we will see how to impute those missing
# A tibble: 7 × 4
name mass height hair_color
<chr> <dbl> <int> <chr>
1 Luke Skywalker 77 172 blond
2 C-3PO 75 167 <NA>
3 R2-D2 32 96 <NA>
4 Darth Vader 136 202 none
5 Leia Organa 49 150 brown
6 Owen Lars 120 178 brown, grey
7 Beru Whitesun Lars 75 165 brown
# A tibble: 7 × 4
name mass height hair_color
<chr> <dbl> <int> <chr>
1 Luke Skywalker 77 172 blond
2 Darth Vader 136 202 none
3 Leia Organa 49 150 brown
4 Owen Lars 120 178 brown, grey
5 Beru Whitesun Lars 75 165 brown
6 Biggs Darklighter 84 183 black
7 Obi-Wan Kenobi 77 182 auburn, white
Try to perform the following exercises without looking at the solutions
📝 Select from the starwars set only those characters that are androids or whose species value is unknown.
📝 Select from the starwars set only the characters whose weight is between 65 and 90 kg.
📝 After clearing absent in all variables, select from the starwars set only the characters that are human and come from Tatooine.
📝 Select from the original starwars set non-human characters, male in sex and measuring between 120 and 170 cm, or characters with brown or red eyes.
📝 Look for information in the str_detect() function help of the {stringr} package (loaded in {tidyverse}). Tip: test the functions you are going to use with some test vector beforehand so that you can check how they work. After you know what it does, filter out only those characters with the last name Skywalker. Check the function str_detect(string, pattern) from the {stringr} package (already included in tidvyerse). Think about the differences between str_detect() and contains()
📝 Keep only characters who have a height between 160 and 190 cm and have a mass between 50 and 90 kg and are not droids. How many characters satisfy all three conditions? Are they mostly Human or not?
📝 Keep only characters who belong to one of the following species ("Human", "Droid", "Wookiee") and whose homeworld is either "Tatooine" or "Naboo". Are there any Wookiees from Naboo ?
📝 Keep characters who satisfy at least one of the following:
"red" or "yellow"BUT exclude characters whose gender is "none".
Up to now all operations performed (even if we used column info) were by rows. In the case of columns, the simplest action is to select variables by name with select(), giving as arguments the column names without quotes.
# A tibble: 87 × 2
name hair_color
<chr> <chr>
1 Luke Skywalker blond
2 C-3PO <NA>
3 R2-D2 <NA>
4 Darth Vader none
5 Leia Organa brown
6 Owen Lars brown, grey
7 Beru Whitesun Lars brown
8 R5-D4 <NA>
9 Biggs Darklighter black
10 Obi-Wan Kenobi auburn, white
# ℹ 77 more rows
The select() function allows us to select several variables at once, including concatenating their names as if they were numerical indexes with :
# A tibble: 4 × 6
name height mass hair_color skin_color eye_color
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 blond fair blue
2 C-3PO 167 75 <NA> gold yellow
3 R2-D2 96 32 <NA> white, blue red
4 Darth Vader 202 136 none white yellow
And we can unselect columns with - in front of it (reminder: - for names/indexes and ! for logical values)
# A tibble: 4 × 4
name height hair_color skin_color
<chr> <int> <chr> <chr>
1 Luke Skywalker 172 blond fair
2 C-3PO 167 <NA> gold
3 R2-D2 96 <NA> white, blue
4 Darth Vader 202 none white
We have also reserved words: everything() all variables….
# A tibble: 4 × 14
mass homeworld name height hair_color skin_color eye_color birth_year sex
<dbl> <chr> <chr> <int> <chr> <chr> <chr> <dbl> <chr>
1 77 Tatooine Luke … 172 blond fair blue 19 male
2 75 Tatooine C-3PO 167 <NA> gold yellow 112 none
3 32 Naboo R2-D2 96 <NA> white, bl… red 33 none
4 136 Tatooine Darth… 202 none white yellow 41.9 male
# ℹ 5 more variables: gender <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
…and last_col() to refer to last column.
# A tibble: 4 × 5
name height mass homeworld starships
<chr> <int> <dbl> <chr> <list>
1 Luke Skywalker 172 77 Tatooine <chr [2]>
2 C-3PO 167 75 Tatooine <chr [0]>
3 R2-D2 96 32 Naboo <chr [0]>
4 Darth Vader 202 136 Tatooine <chr [1]>
We can also play with patterns in the name, those that begin with a prefix (starts_with()), [end with a suffix]{. hl-purple} (ends_with()), contain text (contains()) or fulfill a regular expression (matches()).
# variables which col name finish as "color" and contains sex and gender
starwars |> select(ends_with("color"), matches("sex|gender"))# A tibble: 87 × 5
hair_color skin_color eye_color sex gender
<chr> <chr> <chr> <chr> <chr>
1 blond fair blue male masculine
2 <NA> gold yellow none masculine
3 <NA> white, blue red none masculine
4 none white yellow male masculine
5 brown light brown female feminine
6 brown, grey light blue male masculine
7 brown light blue female feminine
8 <NA> white, red red none masculine
9 black light brown male masculine
10 auburn, white fair blue-gray male masculine
# ℹ 77 more rows
We can even select by numeric range if we have variables with a prefix and numbers.
# A tibble: 3 × 6
wk1 wk2 wk3 wk4 wk5 wk6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 115 7 95 11 NA 21
2 141 NA 162 19 262 15
3 232 17 NA 15 190 23
With num_range() we can select with a prefix and a numeric sequence.
Finally, we can select columns by datatatype using where() and inside a function that returns a logical value of datatype.
# A tibble: 87 × 11
height mass birth_year name hair_color skin_color eye_color sex gender
<int> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 172 77 19 Luke Sk… blond fair blue male mascu…
2 167 75 112 C-3PO <NA> gold yellow none mascu…
3 96 32 33 R2-D2 <NA> white, bl… red none mascu…
4 202 136 41.9 Darth V… none white yellow male mascu…
5 150 49 19 Leia Or… brown light brown fema… femin…
6 178 120 52 Owen La… brown, gr… light blue male mascu…
7 165 75 47 Beru Wh… brown light blue fema… femin…
8 97 32 NA R5-D4 <NA> white, red red none mascu…
9 183 84 24 Biggs D… black light brown male mascu…
10 182 77 57 Obi-Wan… auburn, w… fair blue-gray male mascu…
# ℹ 77 more rows
# ℹ 2 more variables: homeworld <chr>, species <chr>
Try to perform the following exercises without looking at the solutions
📝 Filter the set of characters and keep only those that do not have a missing data in the height variable. With the data obtained from the previous filter, select only the variables name, height, as well as all those variables that CONTAIN the word color in their name.
📝 From the original data set, select just character (string) columns. After that, filter only individuals which eye color contains the words blue. Check the function str_detect(string, pattern) from the {stringr} package (already included in tidvyerse). Think about the differences between str_detect() and contains()
📝 Using the starwars dataset, keep only characters who belong to the human specie and have a height greater than 180 cm. After that, remove observations with missing values in both previous variables. After that, select only the following variables: name, homeworld and all numeric variables.
📝 Using the starwars dataset, keep only characters who are female and not human. After that, remove observations with missing values in homeworld, species and mass. Which species appear in the result? From which planets do they come? To answer this, check function distinct() (including in {dplyr} but try how to use it).
Sometimes we may be interested in performing a non-random discretionary sampling, or in other words, filter by position: with slice(positions) we can select specific rows by passing as argument a index vector.
# A tibble: 1 × 4
name height mass hair_color
<chr> <int> <dbl> <chr>
1 Luke Skywalker 172 77 blond
# A tibble: 4 × 8
name height mass hair_color skin_color eye_color birth_year sex
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none
2 Beru Whitesun L… 165 75 brown light blue 47 fema…
3 Obi-Wan Kenobi 182 77 auburn, w… fair blue-gray 57 male
4 Qui-Gon Jinn 193 89 brown fair blue 92 male
We have default options:
slice_head(n = ...) and slice_tail(n = ...) we can get the header and tail of the tableWe have default options:
slice_max() and slice_min() we get the rows with smallest/largest value of a variable (if tie, all unless with_ties = FALSE) which we indicate in order_by = ....# A tibble: 2 × 4
name height mass hair_color
<chr> <int> <dbl> <chr>
1 Ratts Tyerel 79 15 none
2 Yoda 66 17 white
The so-called simple random sampling is based on selecting individuals randomly, so that each one has certain probabilities of being selected. With slice_sample(n = ...) we can randomly extract n (a priori equiprobable) records.
# A tibble: 2 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 San Hill 191 NA none grey gold NA male mascu…
2 Anakin S… 188 84 blond fair blue 41.9 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Important
“Random” does not imply equiprobable: a normal die is just as random as a trick die. There are no things “more random” than others, they simply have different underlying probability laws.
We can also indicate the proportion of data to sample (instead of the number) and if we want it to be with replacement (that can be repeated).
# A tibble: 4 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Shaak Ti 178 57 none red, blue… black NA fema… femin…
2 Captain … NA NA none none unknown NA fema… femin…
3 Watto 137 NA black blue, grey yellow NA male mascu…
4 Finn NA NA black dark dark NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
As we said, “random” is not the same as “equiprobable”, so we can pass a probability vector. For example, let’s force that it is very improbable to draw a row other than the first two rows
# A tibble: 2 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# A tibble: 2 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none mascu…
2 Luke Sky… 172 77 blond fair blue 19 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
The slice_sample() function is simply a {tidyverse} integration of the basic R function known as sample() that allows us to sample elements
The previous option generates events of a random variable equiprobable but as before, we can assign a vector of probabilities or mass function to it with the argument prob = ....
How would you make the following statement?
Suppose that seasonal flu episodes have been studied in a city. Let \(X_m\) and \(X_p\) be random variables such that \(X_m=1\) if the mother has flu, \(X_m=0\) if the mother does not have flu, \(X_p=1\) if the father has flu and \(X_p=0\) if the father does not have flu. The theoretical model associated with this type of epidemics indicates that the joint distribution is given by \(P(X_m = 1, X_p=1)=0.02\), \(P(X_m = 1, X_p=0)=0.08\), \(P(X_m = 1, X_p=0)=0. 1\) and \(P(X_m = 0, X_p=0)=0.8\)
Generate a sample of size \(n = 1000\) (support "10", "01", "00" and "11") by making use of runif() and by making use of sample().
We can also order by rows according to some variable with arrange().
# A tibble: 5 × 6
name height mass hair_color skin_color eye_color
<chr> <int> <dbl> <chr> <chr> <chr>
1 Ratts Tyerel 79 15 none grey, blue unknown
2 Yoda 66 17 white green brown
3 Wicket Systri Warrick 88 20 brown brown brown
4 R2-D2 96 32 <NA> white, blue red
5 R5-D4 97 32 <NA> white, red red
By from lowest to highest but we can reverse the order with desc().
# A tibble: 5 × 3
name height mass
<chr> <int> <dbl>
1 Yarael Poof 264 NA
2 Tarfful 234 136
3 Lama Su 229 88
4 Chewbacca 228 112
5 Roos Tarpals 224 82
Many times we will need to make sure that there are no duplicates in some variable (DNI) and we can delete duplicate rows with distinct().
# A tibble: 5 × 1
sex
<chr>
1 male
2 none
3 female
4 hermaphroditic
5 <NA>
To keep all the columns of the table we will use .keep_all = TRUE.
# A tibble: 3 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 Leia Org… 150 49 brown light brown 19 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Finally, we can bind new rows with bind_rows() with new observations in table (if columns do not match fill with absent)
# A tibble: 2 × 2
name age
<chr> <dbl>
1 javi 33
2 laura 50
Try to perform the following exercises without looking at the solutions
📝 Select only the characters that are human and brown-eyed, then sort them in descending height and ascending weight.
📝R andomly draws 10 characters but in such a way that the probability of each character being drawn is proportional to its weight (heavier, more likely).
📝 To find out what unique values are in the hair color, remove duplicates of the hair_color variable by first removing the missing ones from the hair_color variable.
📝 Of the characters that are human and taller than 160 cm, eliminate duplicates in eye color, eliminate absent in weight, select the 3 tallest, and order from tallest to shortest in weight. Return the table.
To facilitate the relocation of variables we have a function for it, relocate(), indicating in .after or .before behind or in front of which columns we want to move them.
# A tibble: 87 × 14
species name height mass hair_color skin_color eye_color birth_year sex
<chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Human Luke S… 172 77 blond fair blue 19 male
2 Droid C-3PO 167 75 <NA> gold yellow 112 none
3 Droid R2-D2 96 32 <NA> white, bl… red 33 none
4 Human Darth … 202 136 none white yellow 41.9 male
5 Human Leia O… 150 49 brown light brown 19 fema…
6 Human Owen L… 178 120 brown, gr… light blue 52 male
7 Human Beru W… 165 75 brown light blue 47 fema…
8 Droid R5-D4 97 32 <NA> white, red red NA none
9 Human Biggs … 183 84 black light brown 24 male
10 Human Obi-Wa… 182 77 auburn, w… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 5 more variables: gender <chr>, homeworld <chr>, films <list>,
# vehicles <list>, starships <list>
Sometimes we may also want to modify the “meta-information” of the data, renaming columns. To do this we will use rename() by typing first the new name and then the old.
# A tibble: 87 × 14
nombre altura peso hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
If you look at the output of the select() still a tibble table, it preserves the nature of our data.
Sometimes we will not want such a structure but literally extract the column in a VECTOR, something we can do with pull().
[1] "Luke Skywalker" "C-3PO" "R2-D2"
[4] "Darth Vader" "Leia Organa" "Owen Lars"
[7] "Beru Whitesun Lars" "R5-D4" "Biggs Darklighter"
[10] "Obi-Wan Kenobi" "Anakin Skywalker" "Wilhuff Tarkin"
[13] "Chewbacca" "Han Solo" "Greedo"
[16] "Jabba Desilijic Tiure" "Wedge Antilles" "Jek Tono Porkins"
[19] "Yoda" "Palpatine" "Boba Fett"
[22] "IG-88" "Bossk" "Lando Calrissian"
[25] "Lobot" "Ackbar" "Mon Mothma"
[28] "Arvel Crynyd" "Wicket Systri Warrick" "Nien Nunb"
[31] "Qui-Gon Jinn" "Nute Gunray" "Finis Valorum"
[34] "Padmé Amidala" "Jar Jar Binks" "Roos Tarpals"
[37] "Rugor Nass" "Ric Olié" "Watto"
[40] "Sebulba" "Quarsh Panaka" "Shmi Skywalker"
[43] "Darth Maul" "Bib Fortuna" "Ayla Secura"
[46] "Ratts Tyerel" "Dud Bolt" "Gasgano"
[49] "Ben Quadinaros" "Mace Windu" "Ki-Adi-Mundi"
[52] "Kit Fisto" "Eeth Koth" "Adi Gallia"
[55] "Saesee Tiin" "Yarael Poof" "Plo Koon"
[58] "Mas Amedda" "Gregar Typho" "Cordé"
[61] "Cliegg Lars" "Poggle the Lesser" "Luminara Unduli"
[64] "Barriss Offee" "Dormé" "Dooku"
[67] "Bail Prestor Organa" "Jango Fett" "Zam Wesell"
[70] "Dexter Jettster" "Lama Su" "Taun We"
[73] "Jocasta Nu" "R4-P17" "Wat Tambor"
[76] "San Hill" "Shaak Ti" "Grievous"
[79] "Tarfful" "Raymus Antilles" "Sly Moore"
[82] "Tion Medon" "Finn" "Rey"
[85] "Poe Dameron" "BB8" "Captain Phasma"
Try to perform the following exercises without looking at the solutions
📝 Translate the names of the columns into Spanish.
📝 Place the hair color variable just after the name variable.
📝 Check how many unique modalities there are in the hair color variable (without using unique()).
📝 From the original data set, it removes the list type columns, and then removes duplicates in the eye_color variable. After removing duplicates it extracts that column into a vector.
📝 From the original starwars dataset, with only the characters whose height is known, extract in a vector with that variable.
📝 After obtaining the vector from the previous Exercise, use this vector to randomly sample 50% of the data so that the probability of each character being chosen is inversely proportional to their height (shorter, more options).
In many occasions we will want to modify or create variables with mutate().
Let’s create for example a new variable height_m with the height in meters.
# A tibble: 87 × 15
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>, height_m <dbl>
In addition with the optional arguments we can reposition the modified column
# A tibble: 87 × 16
height_m BMI name height mass hair_color skin_color eye_color birth_year
<dbl> <dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
1 1.72 26.0 Luke … 172 77 blond fair blue 19
2 1.67 26.9 C-3PO 167 75 <NA> gold yellow 112
3 0.96 34.7 R2-D2 96 32 <NA> white, bl… red 33
4 2.02 33.3 Darth… 202 136 none white yellow 41.9
5 1.5 21.8 Leia … 150 49 brown light brown 19
6 1.78 37.9 Owen … 178 120 brown, gr… light blue 52
7 1.65 27.5 Beru … 165 75 brown light blue 47
8 0.97 34.0 R5-D4 97 32 <NA> white, red red NA
9 1.83 25.1 Biggs… 183 84 black light brown 24
10 1.82 23.2 Obi-W… 182 77 auburn, w… fair blue-gray 57
# ℹ 77 more rows
# ℹ 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
Important
When we apply mutate(), we must remember that the operations are performed vector by vector, element by element, so the function we use inside must return a vector of equal length. Otherwise, it will return a constant.
# A tibble: 87 × 15
constante name height mass hair_color skin_color eye_color birth_year sex
<dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 97.3 Luke… 172 77 blond fair blue 19 male
2 97.3 C-3PO 167 75 <NA> gold yellow 112 none
3 97.3 R2-D2 96 32 <NA> white, bl… red 33 none
4 97.3 Dart… 202 136 none white yellow 41.9 male
5 97.3 Leia… 150 49 brown light brown 19 fema…
6 97.3 Owen… 178 120 brown, gr… light blue 52 male
7 97.3 Beru… 165 75 brown light blue 47 fema…
8 97.3 R5-D4 97 32 <NA> white, red red NA none
9 97.3 Bigg… 183 84 black light brown 24 male
10 97.3 Obi-… 182 77 auburn, w… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
We can also combine mutate() with the if_else() control expression to recategorize the variable: if a condition is met, it does one thing, otherwise another.
starwars |>
mutate(human = if_else(species == "Human", "Human", "Not Human"),
.after = name) |>
select(name:mass)# A tibble: 87 × 4
name human height mass
<chr> <chr> <int> <dbl>
1 Luke Skywalker Human 172 77
2 C-3PO Not Human 167 75
3 R2-D2 Not Human 96 32
4 Darth Vader Human 202 136
5 Leia Organa Human 150 49
6 Owen Lars Human 178 120
7 Beru Whitesun Lars Human 165 75
8 R5-D4 Not Human 97 32
9 Biggs Darklighter Human 183 84
10 Obi-Wan Kenobi Human 182 77
# ℹ 77 more rows
For more complex categorizations we have case_when(), for example, to create a category of characters based on their height.
starwars |>
drop_na(height) |>
mutate(altura = case_when(height < 120 ~ "dwarf",
height < 160 ~ "short",
height < 180 ~ "normal",
height < 200 ~ "tall",
TRUE ~ "giant"), .before = name)# A tibble: 81 × 15
altura name height mass hair_color skin_color eye_color birth_year sex
<chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 normal Luke Sk… 172 77 blond fair blue 19 male
2 normal C-3PO 167 75 <NA> gold yellow 112 none
3 dwarf R2-D2 96 32 <NA> white, bl… red 33 none
4 giant Darth V… 202 136 none white yellow 41.9 male
5 short Leia Or… 150 49 brown light brown 19 fema…
6 normal Owen La… 178 120 brown, gr… light blue 52 male
7 normal Beru Wh… 165 75 brown light blue 47 fema…
8 dwarf R5-D4 97 32 <NA> white, red red NA none
9 tall Biggs D… 183 84 black light brown 24 male
10 tall Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male
# ℹ 71 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
Try to perform the following exercises without looking at the solutions
📝 Select only the variables name, height and as well as all those variables related to the color, while keeping only those that are not absent in the height.
📝 With the data obtained from the previous Exercise, translate the names of the columns into Spanish or your mother language.
📝 With the data obtained from the previous Exercise, place the hair color variable just after the name variable.
📝 With the original data, check how many unique modalities there are in the hair color variable.
📝 From the original dataset, select only the numeric and text variables. Then define a new variable called under_18 to recategorize the age variable: TRUE if under age and FALSE if not.
📝 From the original dataset, create a new column named auburn that tells us TRUE if the hair color contains that word and FALSE otherwise (reminder str_detect()).
📝 From the original dataset, include a column that calculates BMI. After that, create a new variable that values NA if not human, underweight below 18, normal between 18 and 30, overweight above 30.
We will analyse Taylor Swift songs from {taylor} package (you need to install it before)
# A tibble: 240 × 29
album_name ep album_release track_number track_name artist featuring
<chr> <lgl> <date> <int> <chr> <chr> <chr>
1 Taylor Swift FALSE 2006-10-24 1 Tim McGraw Taylo… <NA>
2 Taylor Swift FALSE 2006-10-24 2 Picture To Bu… Taylo… <NA>
3 Taylor Swift FALSE 2006-10-24 3 Teardrops On … Taylo… <NA>
4 Taylor Swift FALSE 2006-10-24 4 A Place In Th… Taylo… <NA>
5 Taylor Swift FALSE 2006-10-24 5 Cold As You Taylo… <NA>
6 Taylor Swift FALSE 2006-10-24 6 The Outside Taylo… <NA>
7 Taylor Swift FALSE 2006-10-24 7 Tied Together… Taylo… <NA>
8 Taylor Swift FALSE 2006-10-24 8 Stay Beautiful Taylo… <NA>
9 Taylor Swift FALSE 2006-10-24 9 Should've Sai… Taylo… <NA>
10 Taylor Swift FALSE 2006-10-24 10 Mary's Song (… Taylo… <NA>
# ℹ 230 more rows
# ℹ 22 more variables: bonus_track <lgl>, promotional_release <date>,
# single_release <date>, track_release <date>, danceability <dbl>,
# energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
# acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
# tempo <dbl>, time_signature <int>, duration_ms <int>, explicit <lgl>,
# key_name <chr>, mode_name <chr>, key_mode <chr>, lyrics <list>
Try to answer the questions posed in the workbook intro-tidyverse
To practice some {dplyr} functions we are going to use data from the Lord of the Rings trilogy movies. We will load the data directly from the web (Github in this case), without going through the computer before, simply indicating as path the web where the file is
The Fellowship of the Ring -> https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv
The 2 Towers -> https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv
The Return of the King -> https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv.
library(readr)
lotr_1 <-
read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv")
lotr_2 <-
read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv")
lotr_3 <-
read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")Try to answer the questions posed in the workbook intro-tidyverse
Javier Álvarez Liébana • Course at Real Colegio Complutense at Harvard University